Tolerance Rough Set Model Approach to Document Clustering

نویسندگان

  • Tu Bao Ho
  • Ngoc Binh Nguyen
  • Saori Kawasaki
چکیده

1 A tolerance rough set model Denote the set of M full text documents by D, and the set of N terms from D by T. The method generates a hierarchical structure of D in two phases. The rst phase extracts and maps each document dj 2 D into a list of terms ti each of which is assigned a weight that re BLOCKINects its importance in the document, then enriches documents with their approximations by the proposed tolerance rough set model. The second phase groups documents by an agglomerative clustering method using the document approximations. The key issue in formulating the tolerance rough set model (TRSM) to represent documents is the identica-tion of tolerance classes of index terms. We employ the co-occurrence of index terms in all documents from D to determine a tolerance relation and tolerance classes. Denote by f d j (t i) the number of occurrences of term t i in d j , and by f D (t i) the number of documents in D that term t i occurs in, and by f D (t i ; t j) the number of documents in D in which two index terms t i and t j co-occur. We dene an uncertainty function I depending on a threshold as I(ti) = ftj j fD(ti; tj) g [ ftig (1) It is clear that the function I dened above sat-ises the condition of t i 2 I (t i) and t j 2 I (t i) i t i 2 I (t j) for any t i ; t j 2 T , and so I is both re BLOCKINexive and symmetric. This function corresponds to a tolerance relation I T 2 T that t i It j i t j 2 I (t i), and I (ti) is the tolerance class of index term ti. A vague inclusion function , which determines how much X is included in Y , is dened as (X; Y) = jX\Y j=jXj. Using this the membership function for t i 2 T ; X T can be dened as (t i ; X) = (I (t i); X) = jI (t i) \ Xj=jI (t i)j. With these denitions we can dene a tolerance space as R = (T ; I ; ; P) in which the lower approximation L(R; X) and the upper approximation U(R; X) in R of …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Tolerance Rough Set Approach to Clustering Web Search Results

Two most popular approaches to facilitate searching for information on the web are represented by web search engine and web directories. Although the performance of search engines is improving every day, searching on the web can be a tedious and time-consuming task due to the huge size and highly dynamic nature of the web. Moreover, the user’s “intention behind the search” is not clearly expres...

متن کامل

Document Clustering with Similarity Rough Set Model

Ho et al. proposed a tolerance rough set model (TRSM) for representing documents and successfully applied it to document clustering. In this paper we analyze their algorithm to point out its drawback. We introduce similarity rough set model (SRSM) as another model for presenting documents in document clustering. The model has been evaluated by experiments on test collection.

متن کامل

Rough Document Clustering and The Internet

Searching for information on the web has attracted many research communities. Due to the enormous size of the web and low precision of user queries, finding the right information from the web is the difficult or even impossible task. Clustering, one of the most the fundamental tools in Granular Computing (GrC), offers an interesting approach to this problem. By grouping of similar documents, cl...

متن کامل

Clustering Documents with Large Overlap of Terms into Different Clusters based on Similarity Rough Set Model

Similarity rough set model for document clustering (SRSM) uses a generalized rough set model based on similarity relation and term co-occurrence to group documents in the collection into clusters. The model is extended from tolerance rough set model (TRSM) (Ho and Funakoshi, 1997). The SRSM methods have been evaluated and the results showed that it perform better than TRSM. However, in document...

متن کامل

Improving Quality of Search Results Clustering with Approximate Matrix Factorisations

In this paper we show how approximate matrix factorisations can be used to organise document summaries returned by a search engine into meaningful thematic categories. We compare four different factorisations (SVD, NMF, LNMF and K-Means/Concept Decomposition) with respect to topic separation capability, outlier detection and label quality. We also compare our approach with two other clustering ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007